Understanding US Presidential Speeches from 1900 till Today

Speech Topics and Semantic Similairty Over Time

Author

Harman Singh

1 Introduction

Much criticism of then-Candidate, now-President, Donald Trump has centered on him being unfit for the Office of President of the United States, or his “unpresidentiality”. Some of this criticism also emerges from Donald Trump’s rhetorical style, which has also been deemed “unpresidential”. But what does “Presidentiality” mean? Are there common traits, character qualities, rhetorical styles, or other elements that are common to US Presidents?

While these are valuable and interesting questions to ask, this memo aims to answer two questions related to these questions in detail; namely, whether there are certain topics that are common to US Presidential speeches over time, and whether US Presidents have given speeches and addresses in a way similar to each other?

2 Data

In order to answer these two questions, I use data on Presidential Speeches and Addresses available publicly from the Miller Center at the University of Virginia. Their data on Presidential Speeches is a collection of addresses and speeches, in text form, given by US Presidents from George Washington’s time till the present presidency. While the collection is not exhaustive, it is extensive and contains over 1,000 speeches within it. The data is available to the public in a JSON file format, encoded with a textual transcription of the speech, the date the speech was given, its corresponding President, and title.

3 Methodology

3.1 Selection of the Data

To account for temporal shifts in American society and the realignment in domestic and foreign policy goals, I decided to limit the data to only include speeches from Presidents from the 20th Century on - as such, beginning from Theodore Roosevelt’s Presidency in 1901 onward. This method allows us to capture the historical trend of topics relevant today without skewing the data towards topics that may have been more relevant pre-20th Century but are less so today. In such, our data gains temporal breadth while maintaining relevance for our current world.

Additionally, only speeches given by the President while they were in office are kept within the dataset, and any speeches given while holding other political offices or while campaigning for the Presidency are removed to maintain consistency across for all Presidents within the dataset.

Finally, only minimal text cleaning operations were applied to the speeches, in order to maintain semantic and contextual coherence of the speeches and to optimize the text available for the topic modelling algorithm.

3.2 Analysis of Topics In Speeches

BERTopic Analysis

The first method I used is BERTopic Analysis to identify which topics were found within the speeches. BERTopic is a machine learning technique that groups similar texts into topics by analyzing patterns in language. It uses advanced language models to understand meaning, then clusters related texts together. This helps identify key themes across large sets of documents, like my speeches, without needing manual categorization.

This graph shows the top five topics that were identified by BERTopic from the speeches, graphed year over year on the x-axis. The y-axis shows the proportion of speeches in each year in which the following topics were identified. Proportion helps account for any offsets within our dataset in which some years had more speeches and others less so.

The graph below follows the same x and y-axis outline as above, only this time is split to show each topic’s prevalence individually.

Visualization of these graphs helps reveal important lessons and information:

The top five topics in US presidential speeches of the 20th and 21st Century concern the topics of:
- Banking and Gold
- Health Care
- War and Peace
- Civil Rights and Racial Discourse
- The Vietnam War
Topics spoken about by US Presidents are contingent upon the changes within the domestic or international political systems.
- For example, discussions on Banking and Gold peaked around the Great Depression, when the US left the Gold Standard, and the 2008 Financial but declines at other times.
- Discussion on Vietnam were not significantly prevalent before the Vietnam War nor were they significantly present after.
The topic of Race and Racial/Civil Equality and War and Peace are consistently recurring over the years in the United States, showing that while both conflicts have persisted at the forefront of discussions of the highest Office in the United States, concerns of international wars and peace are more prevelant and important to US Presidents.

Cosine Similarity

The cosine angle \(\cos(\theta)\) between two vectors/documents \(a\) and \(b\) can be defined as:

\[ \cos(\theta) = \frac{a \cdot b}{\|a\| \|b\|} \] where \({\|a\|}\) and \({\|b\|}\) are the magnitudes of vectors \(a\) and \(b\).

Cosine Similarity with Word Embeddings

Cosine similarity is a technique used to measure how similar two pieces of text are by comparing the angle between them in a multi-dimensional space. When combined with word embeddings — mathematical representations of words that capture their meanings and contexts — this method allows us to compare texts based on their semantic similarity. Word embeddings generated by BERTopic understands the nuances of language beyond simply matching words. This is useful for understanding the meaning of sentences.

This graph shows the cosine similarity score of each Presidential speech using the word embeddings method. A darker square equates to a higher score and a lighter score signifies a lower score. Presidents are listed on both the x and y axes in an ascending chronological order, and scores are compared in comparison to each other. The black squares running up diagonally (left-right) signify a score of 1.0, meaning that square is that of the President in question.

This graph shows that most Presidents, from Calvin Coolidge to Bill Clinton, with the exception of Lyndon B. Johnson spoke about similar content to each other. However, Presidents from George W. Bush onwards have spoken on topics more thematically similar to each other rather than other previous Presidents. It also shows that Donald Trump has the lowest level of similarity to Presidents before him compared to other Presidents, a trend that is not the norm.

Cosine Similarity with TF-IDF

Using TF-IDF (Term Frequency–Inverse Document Frequency) cosine similarity, texts are compared based on how often words appear in them, adjusted for how common those words are across all documents. However, it doesn’t capture deeper meaning — two texts might mean the same thing but use different words, and TF-IDF would consider them as not similar. TF-IDF is advantageous for identifying similar vocabulary, while embedding is better for comparing similar ideas or tones.

This graph shows the cosine similarity score of each Presidential speech using the TF-IDF method. A darker square equates to a higher score and a lighter score signifies a lower score. Presidents are listed on both the x and y axes in an ascending chronological order, and scores are compared in comparison to each other. The black squares running up diagonally (left-right) signify a score of 1.0, meaning that square is that of the President in question.

This graph shows that most Presidents, from Calvin Coolidge to Bill Clinton, with the exception of Gerald Ford spoke using similar vocabulary to each other. However, Presidents from George W. Bush onwards have spoken with similar vocabulary to each other, rather than Presidents that have come before. Intriguingly, Joe Biden has had higher levels of terminological consistency with past Presidents than his other fellow recent Presidents. As both Gerald Ford and Donald Trump seem to have lower cosine scores in comparison with fellow Presidents, running a simple line of code reveals that where Ford has an average similarity score of 0.56, Trump has an average similarity score of 0.62. Thus, Gerald Ford was verbally less similar to other Presidents than Donald Trump.

4 Conclusion

This analysis reveals deeper insights into the nature of the US Presidency, particularly the public speaking aspect of the highest office of the land.

By applying BERTopic Analysis and Cosine Similarity using Word Embedding and TF-IDF, this study reveals three main outcomes.

Firstly, that the proportion of topic prevalence in Presidential speeches is time dependent – it fluctuates in accordance with changes in the political spheres, either global or domestic.

Secondly, both with content and vocabulary, there is a similarity between Presidents from Coolidge until Clinton, after which there is a break and a set of new similarities that begin.

Thirdly and lastly, while Donald Trump’s rhetorical choices have been criticized as the great break from previous Presidents, this is only verifiable in terms of topic/thematic consistency and not in terms of vocabulary.

5 Appendix

This is a technical appendix for the operations performed to create this memo.

Part 1: Loading the Data

# load the file
import json as json

with open('speeches.json', 'r') as file:
  speeches = json.load(file)

#convert to a pandas DataFrame
import pandas as pd
df = pd.json_normalize(speeches)

Part 2: Cleaning and Organizing the Text

# We only want to keep Presidents who start from 1900 on and drop all others
keep_presidents = [
    "Theodore Roosevelt", "William Taft", "Woodrow Wilson",
    "Warren G. Harding", "Calvin Coolidge", "Herbert Hoover",
    "Franklin D. Roosevelt", "Harry S. Truman", "Dwight D. Eisenhower",
    "John F. Kennedy", "Lyndon B. Johnson", "Richard M. Nixon",
    "Gerald Ford", "Jimmy Carter", "Ronald Reagan",
    "George H. W. Bush", "Bill Clinton", "George W. Bush",
    "Barack Obama", "Donald Trump", "Joe Biden"
]

df_new = df[df["president"].isin(keep_presidents)]

df_new = df_new.drop(['doc_name', 'title'], axis=1)

# Chronological List
df_new.sort_values("date", inplace=True)
df_new.reset_index(drop=True, inplace=True)

# Define the dates in which the President came to office and when they left 
president_terms = {
    "Theodore Roosevelt": ("1901-09-14", "1909-03-04"),
    "William Taft": ("1909-03-04", "1913-03-04"),
    "Woodrow Wilson": ("1913-03-04", "1921-03-04"),
    "Warren G. Harding": ("1921-03-04", "1923-08-02"),
    "Calvin Coolidge": ("1923-08-02", "1929-03-04"),
    "Herbert Hoover": ("1929-03-04", "1933-03-04"),
    "Franklin D. Roosevelt": ("1933-03-04", "1945-04-12"),
    "Harry S. Truman": ("1945-04-12", "1953-01-20"),
    "Dwight D. Eisenhower": ("1953-01-20", "1961-01-20"),
    "John F. Kennedy": ("1961-01-20", "1963-11-22"),
    "Lyndon B. Johnson": ("1963-11-22", "1969-01-20"),
    "Richard M. Nixon": ("1969-01-20", "1974-08-09"),
    "Gerald Ford": ("1974-08-09", "1977-01-20"),
    "Jimmy Carter": ("1977-01-20", "1981-01-20"),
    "Ronald Reagan": ("1981-01-20", "1989-01-20"),
    "George H. W. Bush": ("1989-01-20", "1993-01-20"),
    "Bill Clinton": ("1993-01-20", "2001-01-20"),
    "George W. Bush": ("2001-01-20", "2009-01-20"),
    "Barack Obama": ("2009-01-20", "2017-01-20"),
    "Donald Trump": ("2017-01-20", "2021-01-20"),
    "Joe Biden": ("2021-01-20", "2025-01-20"),
    "Donald Trump": ("2024-01-20", "2025-04-27"),
}

df_new['date'] = pd.to_datetime(df_new['date'], format='ISO8601', utc=True, errors='coerce')
df_new = df_new.dropna(subset=['date']) # drop any na's
df_new['date'] = df_new['date'].dt.date

# Update dictionary to have start and end in a clean date format
for pres, (start, end) in president_terms.items():
    start_date = pd.to_datetime(start).date()
    end_date = pd.to_datetime(end).date() if end else pd.Timestamp.today().date()
    president_terms[pres] = (start_date, end_date)

# Drop speeches by any president in which they were not actively in office, ensure only presidential speeches in our data
def was_president_at_time(row):
    pres = row['president']
    date = row['date']
    
    if pres in president_terms:
        start, end = president_terms[pres]
        return start <= date <= end
    return False  


df_proper = df_new[df_new.apply(was_president_at_time, axis=1)].reset_index(drop=True)

Part 3: BERTopic Analysis

# Import in all relevant models 
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.ldamodel import LdaModel
from gensim import corpora
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import numpy as np

# Makes text of speech into one constant string, no returns or breaks included
def clean_text(text):
    text = text.replace('\n', ' ')
    return text.strip()
df_proper['cleaned_text'] = df_proper['transcript'].apply(clean_text)

# Create chunks of 300 words of the speeches
def chunk_text(text, max_words=300):
    words = text.split()
    return [' '.join(words[i:i+max_words]) for i in range(0, len(words), max_words)]

df_proper['chunks'] = df_proper['cleaned_text'].apply(chunk_text)

# Create separate dataframe of all chunks flattened to analyze
docs_chunked = [chunk for chunks in df_proper['chunks'] for chunk in chunks]

# Add a speech id index
df_proper = df_proper.reset_index(drop=True)  
df_proper['speech_id'] = df_proper.index + 1

# Initialize BERTopic Model and apply it to the dataframe of just chunks 
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import contextlib
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
umap_model = UMAP(random_state=42)

#embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
#vectorizer_model = CountVectorizer(stop_words="english")

topic_model = BERTopic(
  #embedding_model=embedding_model,
  #vectorizer_model=vectorizer_model,   # additional models that could be used included here
  calculate_probabilities=True,
  verbose=False,
  umap_model = umap_model,
  #top_n_words=7,
  #nr_topics="auto",
)

topics, probs = topic_model.fit_transform(docs_chunked)

# Get information on the topics identified by BERTopic, counts for each topic, associated keywords
topic_info = topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]

topic_info_simple = topic_info[["Topic", 'Count', "Representation"]]

# Remove spaces and joins keywords together for better visibility
topic_info_simple['Representation'] = topic_info_simple['Representation'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))

frequency_table = pd.DataFrame(topic_info_simple)

# Create a new Dataframe of just chunked text with requisite column names, fashioned just like our old DF
chunked_data = []

for idx, row in df_proper.iterrows():
    chunks = chunk_text(row['cleaned_text'], max_words=300)
    for chunk in chunks:
        chunked_data.append({
            "original_speech_id": row['speech_id'],
            "president": row['president'],
            "date": row['date'],
            "transcript": chunk
        })


df_chunked = pd.DataFrame(chunked_data)

# Create topic column that attaches associated topics to each chunk
df_chunked['topic'] = topics

# Removes Topic -1 (topics with keywords such as 'the', 'of', 'a', etc) and replaces chunks with the next most possible Topic that fits 
## Ensures that speech chunks all have the best associated topic and not this catch all extra words 
excluded_topics = [-1]

def reassign_topic(topic, prob_row):
    if topic in excluded_topics:
        sorted_indices = np.argsort(prob_row)[::-1]
        for idx in sorted_indices:
            if idx not in excluded_topics:
                return idx
        return topic  
    else:
        return topic

df_chunked["topic"] = [
    reassign_topic(t, p) for t, p in zip(df_chunked["topic"], probs)
]

# Map topic labels next to the Topic for better visualization
# Attaches the year a speech was given to each column 
topic_labels = {
    row["Topic"]: row["Representation"] 
    for _, row in topic_info_simple.iterrows()
}

df_chunked["topic_label"] = df_chunked["topic"].map(topic_labels)
df_chunked['year'] = pd.to_datetime(df_chunked['date']).dt.year

Data Visualization

# Creates a dataframe of the top 5, most present topics
top_5_topics = df_chunked['topic'].value_counts().head(5).index
df_top_5 = df_chunked[df_chunked['topic'].isin(top_5_topics)]

# Count the number of speeches for each given topic by year
df_count_by_year = df_top_5.groupby(['year', 'topic']).size().reset_index(name='count')

df_total_by_year = df_chunked.groupby('year').size().reset_index(name='total')
df_count_by_year = pd.merge(df_count_by_year, df_total_by_year, on='year')

# Create a proportion of the number of speeches with topics counted by number of total speeches
## Accounts for years where there may be less speeches or more 
df_count_by_year['proportion'] = df_count_by_year['count'] / df_count_by_year['total']
df_count_by_year['topic_labels'] = df_count_by_year["topic"].map(topic_labels)

# Create summary labels for each of the following topics for easier visualization on a graph
manual_labels = {
  0: "Vietnam War",
  1: "Health Care",
  7: "Banks, Credit, Gold",
  2: "Peace, Nations, War",
  3: "Rights, Blacks, White",
  
}

df_count_by_year['manual_labels'] = df_count_by_year["topic"].map(manual_labels)

# Create an area graph for the 5 topics together
library(ggplot2)
count_by_year <- reticulate::py$df_count_by_year

ggplot(count_by_year, aes(x = year, y = proportion, fill = manual_labels)) +
  geom_area() +
  theme_minimal() +
  labs(title = "Top 5 Topics in Presidential Speeches Over Time",
       x = "Year",
       y = "Proportion of Speeches",
       fill = "Topic") +
  scale_fill_viridis_d() +
  theme(plot.title = element_text(size = 12,face='bold'),
        legend.position = "bottom",
        legend.text = element_text(size = 6))

# Create line graph for each topic/graph by itself
ggplot(count_by_year, aes(x = year, y = proportion)) +
  geom_line(aes(color = manual_labels)) +
  facet_wrap(~ manual_labels, scales = "free_y") +  
  theme_minimal() +
  labs(title = "Top 5 Topics in Presidential Speeches Over Time",
       x = "Year",
       y = "Proportion of Speeches",
       color = "Topic") +
  scale_color_viridis_d() +
  theme(
    legend.position = "none",
    plot.title = element_text(size = 10,face='bold')
  )

Part 4: Cosine Similarity

Word Embedding

# Get embedded topics using the BERTopic model 

topics, probs = topic_model.transform(df_chunked["transcript"].tolist())
embeddings = topic_model._extract_embeddings(df_chunked["transcript"].tolist(), method="document")
  
df_chunked["embedding"] = list(embeddings)

# Group presidential speeches and embeddings by president, get average embedding to get just one for each president 
president_embedding = df_chunked.groupby("president")["embedding"].apply(
    lambda emb_list: np.mean(np.vstack(emb_list), axis=0)
    )

# Apply cosine similarity test to the data from above
from sklearn.metrics.pairwise import cosine_similarity

X = np.vstack(president_embedding.values)
cosine_similarity_matrix = cosine_similarity(X)

# Creates Dataframe of the cosine similarity scores and prints the scores of each president in comparison to each other
presidents = president_embedding.index.tolist()

similarity_df = pd.DataFrame(cosine_similarity_matrix, index=keep_presidents, columns=keep_presidents)

Word Embedding Visualization

# Print a heatmap for easier visualization using the viridis color scale for best visualization
library(reshape2)
library(viridis)


similarity_matrix <- reticulate::py$similarity_df
similarity_matrix <- as.matrix(similarity_matrix)
melted_matrix <- melt(similarity_matrix, varnames = c("president_1", "president_2"))

ggplot(melted_matrix, aes(x = president_1, y = president_2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.3) +  
  scale_fill_viridis(
    option = "viridis",  # Try "magma", "plasma", or "inferno" for other easily visualizable variants
    direction = -1,     
    limits = c(min(melted_matrix$value), max(melted_matrix$value)) 
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
    axis.text.y = element_text(size = 10),
    legend.position = "right",
    plot.title = element_text(size = 10,face='bold')  

  ) +
  labs(
    x = "President",
    y = "President",
    title = "Word Embedding - Cosine Similarity of Presidential Speeches",
    fill = "Similarity"
  ) +
  coord_fixed()

TF-IDF Cosine Similarity

# Import vectorizer and list of stopwords to clean our text
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Define function to remove stopwords from a text
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Join all text into a singular string 
presidents_aggregated = df_proper.groupby('president')['cleaned_text'].apply(" ".join).reset_index()

# Cleans the text by removing stopwords
presidents_aggregated['cleaned_text'] = presidents_aggregated['cleaned_text'].apply(remove_stopwords)

# Vectorize the text and save the vectorized text seperately
vectorizer = CountVectorizer()
president_dfm = vectorizer.fit_transform(presidents_aggregated['cleaned_text'])

# Apply cosine similarity test and create a Dataframe 
pres_cosine = cosine_similarity(president_dfm,president_dfm)

similarity_df_2 = pd.DataFrame(
    pres_cosine,
    index=keep_presidents,
    columns=keep_presidents
)

TF-IDF Visualization

# Creates another heatmap in the same style as the previous one, using viridis for ease of visualization
similarity_matrix_2 <- reticulate::py$similarity_df_2
similarity_matrix_2 <- as.matrix(similarity_matrix_2)
melted_matrix_2 <- melt(similarity_matrix_2, varnames = c("president_1", "president_2"))


ggplot(melted_matrix_2, aes(x = president_1, y = president_2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.3) +  
  scale_fill_viridis(
    option = "viridis",  # Try "magma", "plasma", or "inferno" for other easily visualizable variants
    direction = -1,     
    limits = c(min(melted_matrix_2$value), max(melted_matrix_2$value)) 
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
    axis.text.y = element_text(size = 10),
    legend.position = "right",
    plot.title = element_text(size = 10, face='bold')  

  ) +
  labs(
    x = "President",
    y = "President",
    title = "TF-IDF Cosine Similarity of Presidential Speeches",
    fill = "Similarity"
  ) +
  coord_fixed()